Study of the Discrepancy Between Client - and Server Side
نویسنده
چکیده
Web usage mining aims at characterizing user navigation on websites by applying techniques available through machine learning to server log data. In doing so, it is implicitly assumed that the requests made by a user to the server during one session capture the actual sequence of pages viewed by that user. In literature, this page-view sequence is referred to as “clickstream”. However, certain events such as switching between tabs or navigating backwards in web browser history are exclusive to the client side, thus inducing a discrepancy between the user actions as they are recorded server side and the client side clickstream. In addition, web proxy caches may answer client requests without relaying information to the server. The focus of this work is to analyze how the complementary information in the client side clickstream affects the outcome of a selected web usage mining algorithm. Should a notable performance boost be observed a continuation of the activities would be to try and learn/reconstruct the client side clickstream from server side information. To facilitate capture of the client side clickstream the ClickStreamRecorder (CSR) Firefox browser extension was developed and distributed. Collection of realistic data happened through the “Six Degrees of Kevin Bacon” online user study. Participants were asked to install CSR and given the task to establish a connection between 2 random actors on the imdb.com movie website. A support vector machine (SVM) was implemented with different string kernels to allow for the classification of the obtained clickstream data. Using the SVM, it was attempted to predict the successful completion of the user study task based on both the client and server side clickstream. For the evaluated data, it turned out that while prediction based on the client side data returned slightly better results, classification accuracy was limited in general. Another learning task concerned the prediction of client side events using the server-side clickstream. The significance of the classification results was found to be limited. An overview of existing web usage mining techniques is included in this work. To the author’s knowledge this is the first time, the distinction between clientand server side clickstream is considered and analyzed in that context. Sammanfattning Studie av skillnader i loggning av klicksekvenser på klientrespektive server-sidan Web usage mining har till syfte att kartlägga hur användare navigerar på webbsidor. Detta sker genom att använda verktyg ur området maskininlärning på server-loggfiler. Härigenom antas implicit att förfrågningar till servern återspeglar i vilken följd en användare har besökt webbsidor under en session. I litteraturen kallas denna sekvens för “clickstream” (“klicksekvens”). Det förekommer dock ett antal händelser som endast registreras på klientens sida och på detta vis uppstår en skillnad mellan serverns och klientens clickstream. Till denna klass av klienthändelser hör bland annat att byta mellan olika flikar eller browserfönster och navigering i browserns cache, t.ex. genom att trycka på tillbakaknappen. Dessutom svarar mellanliggande webbproxycacheservrar på en del förfrågningar utan att dessa når fram till den egentliga servern. Rapporten ägnar sig främst åt att ta reda på hur den komplementära information som är tillgänglig (endast) i klient-clickstreamen påverkar en vald web usage mining-algoritm. Ifall en ökad prestanda noteras skulle det vara lönsamt att försöka återskapa klientens clickstream ur server information. För att möjliggöra inhämtning av klient-clickstreamen utvecklades ClickStreamRecorder (CSR), ett Firefox plug-in. En användarstudie genomfördes på webbsidan imdb.com. Deltagare fick lösa problemet “Six Degrees of Kevin Bacon” som går ut på att hitta en länk mellan två slumpmässigt valda skådespelare. Författaren implementerade en supportvektormaskin och olika strängkernels för att lösa klassificeringsproblem på inhämtade data. Ett mål var att försöka förutsäga huruvida en given clickstream tyder på att personen var framgångsrik med att lösa uppgiften i användarstudien. Det visade sig att klassificeringen blir obetydligt bättre genom att använda klientens clickstream. Ytterligare ett inlärningsproblem som togs upp är förmågan att bestämma antalet klienthändelser ur server-clickstreamen. Resultaten var föga signifikanta. Rapporten ger en överblick över olika metoder i området web usage mining. Enligt bästa vetande, är detta första gången som det tas hänsyn till skillnaden mellan klientoch server-clickstreams i sammanhanget. Acknowledgements This MSc thesis work was carried out at Fraunhofer IPSI. The author would like to thank Dr. Ulrike von Luxburg, Markus Weimer and Steffen Hartmann for supervising the work. Further, Jeong-Ho Chang’s help is greatly appreciated. Finally, a thank you to Corps Franconia, Darmstadt for their hospitality.
منابع مشابه
Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation
Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation Introduction The advent of Web 2.0 enables the users to interact and prepare free unlimited real time data. This advantage leads us to exploit Volunteer Geographic Information (VGI) for real time crisis management. Traditional estimation methods for earthquake damages are expensive and tim...
متن کاملRipley: Automatically Securing Distributed Web Applications Through Replicated Execution
Rich Internet applications are becoming increasingly distributed, as demonstrated by the popularity of AJAX/Web 2.0 applications such as Hotmail, Google Maps, Facebook, and many others. A typical multi-tier AJAX application consists of a server component implemented in Java J2EE, PHP or ASP.NET and a client-side component executing in JavaScript. The resulting application is more performant and...
متن کاملA CSA Method for Assigning Client to Servers in Online Social Networks
This paper deals with the problem of user-server assignment in online social network systems. Online social network applications such as Facebook, Twitter, or Instagram are built on an infrastructure of servers that enables them to communicate with each other. A key factor that determines the facility of communication between the users and the servers is the Expected Transmission Time (ETT). A ...
متن کاملA Novel Method for VANET Improvement using Cloud Computing
In this paper, we present a novel algorithm for VANET using cloud computing. We accomplish processing, routing and traffic control in a centralized and parallel way by adding one or more server to the network. Each car or node is considered a Client, in such a manner that routing, traffic control, getting information from client and data processing and storing are performed by one or more serve...
متن کاملطراحی وب سرویس مدیریت امدادرسانی پس از وقوع سیل با کمک اطلاعات جغرافیایی داوطلبانه (VGI) بر مبنای تکنولوژی متن باز
Accessibility to precise spatial and real time data plays a valuable role in the velocity and quality of flood relief operation and subsequently, scales the human and financial losses down. Flood real time data collection and processing, for instance, precise location and situation of flood victims may be a big challenge in Iran regarding the hardware facilities (such as high resolution aerial ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006